library(tidyverse) # For data manipulation and plotting
library(stats) # For basic statistical tests
library(multcomp) # For multiple comparison corrections
library(lme4) # For mixed-effects modeling
library(broom) # For tidying model outputs
library(data.table) # Efficient data handling
library(edgeR) # RNAseq data processing
library(limma) # RNAseq linear models
library(ggrepel) # Label points in scatterplot
library(kimma) # RNAseq linear mixed effects models
library(BIGpicture) # RNAseq plots
select <- dplyr::select
# NOTE --- REPLACE the <data_dir> FOLDER DESTINTATION AS APPROPRIATE
# data_dir <- '/home/processed_data'
data_dir <- "data/"
# These are already log2-CPM
ncts <- readr::read_csv(file.path(data_dir, "foreman_etal_counts.csv"))
meta <- readr::read_csv(file.path(data_dir, "foreman_etal_meta.csv"))
meta <- meta %>%
mutate(logCFU = log10(CFU + 1))Identify granuloma associated T cell genes and validate in single-cell data
Data
This TB Hackday script uses (pre-processed) RNA sequencing data from the following study to identify granuloma-associated T cell genes:
- Foreman et al. 2023 (J Exp Med) CD30 co-stimulation drives differentiation of protective T cells during Mycobacterium tuberculosis infection
Then the expression of these genes can be further examined and validated using pseudo-bulk transcriptomic data from scRNAseq data in this study:
- Bromley et al. 2024 (Immunity) CD4+ T cells re-wire granuloma cellularity and regulatory networks to promote immunomodulation following Mtb reinfection
Background
Foreman et al. compared the gene expression of T cells isolated from PBMC vs. T cells from granulomas (n=4 NHP, n=23 granulomas). They identified genes that were associated with granuloma T cells, and specifically identified genes that were correlated with Mtb burden in the granuloma (CFU).
Bromley et al. report on an experiment with three groups of cynomolgus macaques: (1) anti-CD4 treated, Mtb-exposed (n=7), (2) IgG control, Mtb-exposed (n=6), (3) No treatment, Mtb-naive (n=6). Groups (2) and (3) were infected with Mtb, then given an anti-CD4 antibody to deplete CD4+ T cells or an isotype control (IgG) antibody, and finally challenged with a secondary Mtb infection. Group 3 only received a primary Mtb infection. Granulomas were then analyzed using single-cell RNAseq, with 3 NHP from (1) and 2 NHP from (2) and (3) each. We have created pseudo-bulk datasets using two different clustering of the cells, offering two levels of cell type granularity for analysis: bromley_X_pseudobulk_counts.csv where X is coarse or subclustering, with clusters defined by the authors of the manuscript. By analyzing data from primary infection, reinfection, and reinfection-CD4+ T cell-depleted granulomas, they found that the presence of CD4+ T cells during reinfection resulted in a less inflammatory lung milieu characterized by reprogrammed CD8+ T cells, reduced neutrophilia, and blunted type 1 immune signaling among myeloid cells.
Hypotheses for hacking
- Identify genes that are associated with CD4 or CD8 T cells in granulomas and validate their expression in the clusters of granuloma T cells in the Bromley et al. study
- Are there genes associated with low-CFU granulomas in Foreman et al. that are also associated with low-CFU granulomas in Bromley et al.?
Setup R and load the data.
Load relevant packages. Change <data_dir> variable as appropriate.
Load the pre-processed RNA sequencing data. There are several important files:
foreman_et_al_counts.csvcontains log2-transformed counts per million (log2-CPM), that were computed from raw counts by the study authors. The table contains 84 columns, with one columngene_idand the remaining columns matchingsampleids in the metadata. There are 30,689 genes in the dataset.foreman_etal_meta.csvcontains all the 83 sample- and granuloma-level metadata that is available for these samples and animals. There are 27 CD8+ T cell samples sorted from granulomas and 31 CD4+ T cell samples sorted from granulomas. Other variables includesubject,sort,condition,sex, and granulomaCFU.bromley_coarse_pseudobulk_counts.csvis a long-form table containing counts for 33 granulomas from 7 NHP, with over 25K genes from 15 coarse clusters of cells from the granulomas (44 sub-clusters). There are columns forbiosample_id,CoarseClustering,countsandgene. Counts are not log-normalized. Pseudo-bulk counts were created by summing the counts across all cells in a cluster for each sample.bromley_coarse_pseudobulk_counts.meta.csvcontains meta-data that can be joined to the counts data onbiosample_id. Variables includedonor_id,Group,CFU(for the granuloma), andCoarseClustering/SubclusteringV2.
Identify genes that are associated with granuloma T cells
This analysis should match part of the analysis presented in Foremant et al. Figure 1. It’s a bit of a strange statistical test because it compares gene expression in each granuloma T cell sort (n=31/27 for CD4/CD8) to gene expression in each PBMC sample (n=4), but we can use a rank-based test to find the genes that are differentially expressed in the two tissues. These will become candidate genes for downstream analysis, so statistical significance is not critical.
Steps for analysis: 1. Filter genes, keeping those that have at least 1 log-normalized count in 60% of the samples. 2. Filter to just the CD4 granuloma and PBMC samples (could alternatively focus on CD8) 3. Prepare the data as a matrix for computationally efficient testing 4. Apply Mann-Whitney test across all genes and collect the results 5. Print the top 20 results with an estimated FDR q-value
# Section 4: Filter genes for analysis
# ------------------------------------
ltpm <- ncts %>%
column_to_rownames("gene_id") %>%
filter(rowMeans(. > 1) > 0.6)
genes_filter <- rownames(ltpm)
# Merge metadata and log-transformed counts
meta_ss <- meta %>%
filter(condition %in% c("CD4_gran", "CD4_PBMC")) %>%
column_to_rownames("sampleid")
# Ensure `ltpm` only includes columns for the filtered samples
ltpm_ss <- ltpm %>%
select(all_of(rownames(meta_ss)))
mdf <- meta_ss %>%
rownames_to_column("sampleid") %>%
bind_cols(t(ltpm_ss)) # Ensure samples (columns of ltpm) match rows of meta
# Section 5: Mann-Whitney U Test for Gene Associations
# ----------------------------------------------------
# Convert mdf to a matrix for gene expression data
expression_data <- as.matrix(mdf %>% select(all_of(genes_filter)))
conditions <- mdf$condition
# Precompute the grouping indices for faster sub-setting
gran_idx <- which(conditions == "CD4_gran")
pbmc_idx <- which(conditions == "CD4_PBMC")
# Vectorized function to compute statistics for all genes
collect_res_matrix <- function(expression_data, gran_idx, pbmc_idx, genes) {
gran_expr <- expression_data[gran_idx, , drop = FALSE]
pbmc_expr <- expression_data[pbmc_idx, , drop = FALSE]
# Perform Wilcoxon tests in bulk
pvalues <- apply(expression_data, 2, function(gene_expr) {
wilcox.test(gene_expr[gran_idx], gene_expr[pbmc_idx])$p.value
})
# Calculate mean differences
mean_gran <- colMeans(gran_expr, na.rm = TRUE)
mean_pbmc <- colMeans(pbmc_expr, na.rm = TRUE)
mean_diff <- mean_gran - mean_pbmc
# Assign "GRAN" or "PBMC" based on the mean difference
assoc <- ifelse(mean_diff > 0, "GRAN", "PBMC")
# Combine results into a data frame
results <- data.frame(
gene = genes,
pvalue = pvalues,
assoc = assoc,
stringsAsFactors = FALSE
)
return(results)
}
# Run the matrix-optimized function
results_df <- collect_res_matrix(expression_data, gran_idx, pbmc_idx, genes_filter)
# Adjust p-values for multiple comparisons
results_df <- results_df %>%
mutate(FDRq = p.adjust(pvalue, method = "fdr")) %>%
arrange(pvalue)
# Other genes of interest from the manuscript
cd4_genes <- c("KLRB1", "CD40LG", "S100A11", "S100A4", "IL26", "BATF")
cd8_genes <- c("APOBEC3G", "IFNG", "TNF", "CCL1", "CCL20")
gran_genes <- results_df %>%
filter(FDRq < 0.1) %>%
filter(assoc == 'GRAN') %>%
pull(gene)
print(head(results_df, 20)) gene pvalue assoc FDRq
APOBEC3H APOBEC3H 3.81971e-05 GRAN 0.007986249
CD2 CD2 3.81971e-05 GRAN 0.007986249
CHMP4A CHMP4A 3.81971e-05 GRAN 0.007986249
CXCR6 CXCR6 3.81971e-05 GRAN 0.007986249
DNAJB1 DNAJB1 3.81971e-05 GRAN 0.007986249
EEF1G EEF1G 3.81971e-05 PBMC 0.007986249
FAM65B FAM65B 3.81971e-05 PBMC 0.007986249
FTL FTL 3.81971e-05 PBMC 0.007986249
GAPDH GAPDH 3.81971e-05 GRAN 0.007986249
GIMAP7 GIMAP7 3.81971e-05 GRAN 0.007986249
IL2RA IL2RA 3.81971e-05 GRAN 0.007986249
ITGAE ITGAE 3.81971e-05 GRAN 0.007986249
LOC100423954 LOC100423954 3.81971e-05 GRAN 0.007986249
LOC100425072 LOC100425072 3.81971e-05 PBMC 0.007986249
LOC100426537 LOC100426537 3.81971e-05 GRAN 0.007986249
LOC106993126 LOC106993126 3.81971e-05 GRAN 0.007986249
LOC694538 LOC694538 3.81971e-05 GRAN 0.007986249
LOC695218 LOC695218 3.81971e-05 PBMC 0.007986249
LOC702809 LOC702809 3.81971e-05 PBMC 0.007986249
LOC703774 LOC703774 3.81971e-05 PBMC 0.007986249
Fit the model to identify genes associated with protection (lower CFU)
Now we will test the granuloma-associated genes to see if they are associated with protection.
The sampleid columns of the ncts variable and the rows of sampleid in the meta variable match. For this first analysis we will focus on the CD8 T cells sorted from PBMC or granulomas, creating subset tables indicated by _ss variable. Then we initialize the DGEList object and create a limma-voom model with a design matrix to identify genes that are associated with granuloma Mtb burden (CFU).
In the accompanying “mean-variance” plot the x-axis represents the average expression levels of genes across all samples. The y-axis represents the square-root of the variance (i.e., standard deviation) of gene expression levels. It shows how the variance changes with respect to the mean expression. Every dot is a gene and the trend line shows the relationship between the mean and the variance. Note that the variance is relatively stable across expression levels and the relationship is smooth; this is good for analysis and voom will use this relationship to adjust the model fits of each gene. If you re-run the code block without the filtering you will see the impact on the mean-variance plot.
meta_ss = meta %>% filter(condition == "CD4_gran")
keep_ids = meta_ss %>% pull(sampleid)
keep_ids = c('gene_id', keep_ids)
ncts_ss = ncts %>% dplyr::select(any_of(keep_ids))
# Keep only the genes from
ncts_ss <- ncts_ss %>%
filter(gene_id %in% gran_genes)
# Move gene ID to rownames
ncts_ss_mat <- as.matrix(ncts_ss[,-1])
rownames(ncts_ss_mat) = ncts_ss$gene_id
# FOR TESTING ALL GENES: alternatively, discard genes that have low counts/prevalence
# filter = rowSums(ncts_ss > 1) >= (0.5 * ncol(ncts_ss))
# ncts_ss = ncts_ss[filter, ]
# Create the object for differential expression testing
dge_o = DGEList(counts=ncts_ss_mat,
genes=rownames(ncts_ss_mat),
samples=meta_ss,
group=meta_ss[['logCFU']])
# Specify the model/design matrix
design_temp = model.matrix(~logCFU, data=dge_o$samples)
# Create the voom object and fit the model
v <- voomWithQualityWeights(dge_o, design=design_temp, plot=TRUE)#Fit model
fit = lmFit(v, design_temp)
# Estimate contrasts and p-values
fit = eBayes(fit, robust=TRUE)
summary(decideTests(fit, adjust.method="fdr", p.value = 0.2)) (Intercept) logCFU
Down 0 33
NotSig 0 240
Up 298 25
results <- topTable(fit, adjust="BH", coef="logCFU", p.value=1, number=Inf, resort.by="P")
head(results %>% dplyr::select(genes, logFC, AveExpr, P.Value, adj.P.Val), 20) genes logFC AveExpr P.Value adj.P.Val
SYTL2 SYTL2 0.11492280 11.41220 0.0004588508 0.07600388
ASH1L ASH1L 0.11836029 11.51505 0.0006027642 0.07600388
KLRB1 KLRB1 -0.18640544 11.59983 0.0007651397 0.07600388
PSMA2 PSMA2 -0.04935484 12.30006 0.0024829731 0.10056038
CHD1 CHD1 0.08975339 11.45801 0.0026723083 0.10056038
LOC100426632 LOC100426632 -0.08040552 12.18295 0.0030091584 0.10056038
LOC100430627 LOC100430627 -0.06177910 12.35178 0.0035021669 0.10056038
FOSB FOSB 0.15424175 11.27775 0.0038699198 0.10056038
PRDX1_1 PRDX1_1 -0.04523569 12.13187 0.0039487373 0.10056038
TSPAN5 TSPAN5 0.10415933 11.48621 0.0047585952 0.10056038
PHF20 PHF20 0.07334830 11.60675 0.0050061585 0.10056038
RRBP1 RRBP1 0.16407615 11.17417 0.0051004715 0.10056038
LOC100423954 LOC100423954 -0.10162656 12.24038 0.0054423765 0.10056038
DNAJB1 DNAJB1 0.08412195 12.36983 0.0058819762 0.10056038
LOC721878 LOC721878 -0.08548157 11.63150 0.0061394561 0.10056038
RARRES3 RARRES3 -0.04687187 12.35166 0.0063386579 0.10056038
MYL6 MYL6 -0.02998800 12.55174 0.0065410830 0.10056038
TRAF1 TRAF1 0.10035311 11.61048 0.0066815712 0.10056038
RNF4 RNF4 0.06467035 11.81728 0.0070282116 0.10056038
SERPINB9 SERPINB9 0.05040778 11.87255 0.0073799232 0.10056038
Create a volcano plot for single-gene association with protection
# Add a column for significance based on FDR
results <- results %>%
mutate(Significance = ifelse(adj.P.Val < 0.2, "Significant", "Not Significant"))
# Select the top 10 genes based on adjusted p-value for labeling
top_genes <- results %>%
arrange(adj.P.Val) %>%
slice_head(n = 10)
max_logFC <- max(abs(results$logFC), na.rm = TRUE)
# Create the volcano plot
volcano_plot <- ggplot(results, aes(x = logFC, y = -log10(P.Value))) +
geom_point(aes(color = Significance), alpha = 0.6) +
scale_color_manual(values = c("Significant" = "red", "Not Significant" = "grey")) +
geom_text_repel(data = top_genes,
aes(label = genes),
max.overlaps = 10,
box.padding = 0.3,
point.padding = 0.3,
segment.color = "grey50",
size = 3) +
xlim(c(-max_logFC, max_logFC)) +
theme_minimal() +
labs(
x = "log2 Estimate",
y = "-log10 P-value",
color = "FDR < 0.2") +
theme(plot.title = element_text(hjust = 0.5))
volcano_plotRedo the analysis using a mixed-effects model to incorporate repeated measures
Using kimma, perform the same analysis for CFU only now taking into account the within animal variability with multiple granulomas samples from the same animal. We will still focus on CD4+ cells from granulomas.
# Use the same voom object as with limma
klm <- kmFit(dat = v,
model = "~logCFU + (1|subject)",
run_lme = TRUE,
libraryID="sampleid",
patientID="subject",
use_weights = TRUE,
metrics = FALSE,
processors=1)lme/lmerel model: expression~logCFU+(1|subject)
Input: 31 libraries from 4 unique patients
Model: 31 libraries
Complete: 298 genes
Failed: 0 genes
summarise_kmFit(fdr = klm$lme)# A tibble: 3 × 7
variable `fdr<0.05` `fdr<0.1` `fdr<0.2` `fdr<0.3` `fdr<0.4` `fdr<0.5`
<fct> <int> <int> <int> <int> <int> <int>
1 (1 | subject) 30 36 44 53 57 67
2 logCFU 0 29 63 77 117 138
3 total (nonredund… 30 65 101 119 154 178
plot_volcano(model_result = klm,
model = "lme", variables = "logCFU",
y_cutoff = 0.2, label = 10)Visulize results for a single gene
# Scatter-plot of Gene Expression vs. log-CFU
gene <- "TNFRSF4"
ggplot(mdf, aes(x = logCFU, y = get(gene))) +
geom_point() +
geom_smooth(method = "lm", formula = y ~ x) +
labs(
title = paste("Correlation of", gene, "with logCFU"),
x = "logCFU",
y = paste("Expression of", gene)
) +
theme_minimal()Warning: Removed 4 rows containing non-finite outside the scale range
(`stat_smooth()`).
Warning: Removed 4 rows containing missing values or values outside the scale range
(`geom_point()`).
# Boxplot of Gene Expression by Condition
ggplot(mdf, aes(x = condition, y = get(gene))) +
geom_boxplot(outlier.shape = NA) +
geom_jitter(width = 0.2, alpha = 0.7) +
labs(
title = paste("Expression of", gene, "by Condition"),
x = "Condition",
y = paste("Expression of", gene)
) +
theme_minimal()Use the Bromley et al. data to identify cells that express protective genes and validate their association with protection
We can use the Bromley
# Load Bromley data
brom <- read_csv(file.path(data_dir, "bromley_coarse_pseudobulk_counts.csv"))New names:
Rows: 11603648 Columns: 5
── Column specification
──────────────────────────────────────────────────────── Delimiter: "," chr
(3): biosample_id, CoarseClustering, gene dbl (2): ...1, counts
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
• `` -> `...1`
bmeta <- read_csv(file.path(data_dir, "bromley_coarse_pseudobulk_counts.meta.csv"))New names:
Rows: 1251 Columns: 14
── Column specification
──────────────────────────────────────────────────────── Delimiter: "," chr
(12): ...1, biosample_id, donor_id, array number, Sample Name, Sample ty... dbl
(2): Infusion before 2nd Mtb infection anti CD4 or IgG, CFU Total
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
• `` -> `...1`
# Rename columns for consistency
bmeta <- bmeta %>%
rename(CFU = `CFU Total`) %>%
mutate(logCFU = log10(CFU + 1))
# Visualize distribution of logCFU
ggplot(bmeta, aes(x = logCFU)) +
geom_density(fill = "blue", alpha = 0.3) +
labs(title = "Distribution of logCFU", x = "logCFU", y = "Density") +
theme_minimal()# Define function to preprocess gene-specific data
prepare_ss <- function(gene) {
ss <- brom %>% filter(gene == !!gene) # Subset for the specific gene
tot <- ss %>%
group_by(biosample_id) %>%
summarize(tot = sum(counts), .groups = "drop")
ss <- ss %>%
left_join(tot, by = "biosample_id") %>%
left_join(bmeta, by = c("biosample_id", "CoarseClustering")) %>%
mutate(
lcpm = log2((counts + 0.01) / tot),
cpm = counts / tot
)
return(ss)
}
# Perform Spearman correlation for CD4 and CD8 genes
cd4_genes <- c("TNFRSF4", "KLRB1", "CD40LG", "S100A11", "S100A4", "IL26", "BATF")
cd8_genes <- c("APOBEC3G", "IFNG", "TNF", "CCL1", "CCL20")
gene_list <- c(cd4_genes, cd8_genes)
res <- lapply(gene_list, function(g) {
ss <- prepare_ss(g)
unique_clusters <- unique(ss$CoarseClustering)
cluster_res <- lapply(unique_clusters, function(clust) {
tmp <- ss %>% filter(CoarseClustering == clust)
# Calculate Spearman correlation
cor_res <- cor.test(tmp$cpm, tmp$logCFU, method = "spearman")
data.frame(
gene = g,
cluster = clust,
n = nrow(tmp),
rho = cor_res$estimate,
pvalue = cor_res$p.value
)
})
do.call(rbind, cluster_res)
})
# Combine results into a single data frame
res_df <- do.call(rbind, res) %>%
arrange(pvalue) %>%
mutate(FDRq = p.adjust(pvalue, method = "fdr"))
print(res_df) gene cluster n rho pvalue
rho99 TNF T,NK cells 348 -0.619029412 4.357425e-38
rho95 IL26 T,NK cells 348 0.545054217 3.049259e-28
rho93 S100A11 T,NK cells 348 -0.437176233 1.121929e-17
rho97 APOBEC3G T,NK cells 348 -0.347051798 2.757506e-11
rho1311 CCL20 Alveolar Type 2 cells 79 0.623885558 8.130907e-10
rho69 TNF Neutrophils 61 0.635839221 3.671837e-08
rho14 TNFRSF4 Fibroblasts 98 0.493176174 2.479051e-07
rho411 CCL20 Endothelial cells 166 0.387103582 2.575956e-07
rho42 CD40LG Endothelial cells 166 0.369715775 9.472664e-07
rho67 APOBEC3G Neutrophils 61 0.541112026 6.714981e-06
rho81 KLRB1 Macrophages 185 -0.320543085 9.162651e-06
rho49 TNF Endothelial cells 166 0.335576469 9.880245e-06
rho33 S100A11 Mast cells 84 -0.441355616 2.641337e-05
rho124 S100A4 pDCs 31 0.678032014 2.776026e-05
rho34 S100A4 Mast cells 84 -0.429173719 4.631410e-05
rho83 S100A11 Macrophages 185 0.292789558 5.246485e-05
rho66 BATF Neutrophils 61 0.488820089 6.411343e-05
rho132 CD40LG Alveolar Type 2 cells 79 0.432988366 6.732569e-05
rho137 APOBEC3G Alveolar Type 2 cells 79 0.410070803 1.746068e-04
rho37 APOBEC3G Mast cells 84 -0.396256246 1.904979e-04
rho35 IL26 Mast cells 84 -0.394911153 2.202207e-04
rho147 APOBEC3G Fibroblasts 98 0.364119481 2.280592e-04
rho104 S100A4 cDC 1 32 0.588558895 3.952432e-04
rho113 S100A11 cDC 2 32 0.597507331 3.963332e-04
rho94 S100A4 T,NK cells 348 0.182228433 6.358543e-04
rho87 APOBEC3G Macrophages 185 0.245077236 7.731254e-04
rho142 CD40LG Fibroblasts 98 0.334016379 7.759048e-04
rho1310 CCL1 Alveolar Type 2 cells 79 0.364754670 1.197215e-03
rho4 TNFRSF4 Endothelial cells 166 0.247966873 1.276784e-03
rho96 BATF T,NK cells 348 -0.169069070 1.548633e-03
rho13 TNFRSF4 Alveolar Type 2 cells 79 0.349917826 1.571533e-03
rho92 CD40LG T,NK cells 348 -0.168004899 1.685766e-03
rho10 TNFRSF4 cDC 1 32 0.527465166 1.920709e-03
rho1210 TNF pDCs 31 0.521775839 2.608582e-03
rho133 S100A11 Alveolar Type 2 cells 79 0.334149819 2.615340e-03
rho36 BATF Mast cells 84 -0.317813745 3.219631e-03
rho9 TNFRSF4 T,NK cells 348 -0.157131746 3.338380e-03
rho84 S100A4 Macrophages 185 -0.214634308 3.348127e-03
rho1411 CCL20 Fibroblasts 98 0.293193284 3.390691e-03
rho3 TNFRSF4 Mast cells 84 -0.314944618 3.732179e-03
rho88 IFNG Macrophages 185 -0.211097666 3.922178e-03
rho145 IL26 Fibroblasts 98 -0.283053101 4.740934e-03
rho143 S100A11 Fibroblasts 98 0.278794650 5.438269e-03
rho148 IFNG Fibroblasts 98 0.278171967 5.547530e-03
rho410 CCL1 Endothelial cells 166 -0.214360931 6.159689e-03
rho131 KLRB1 Alveolar Type 2 cells 79 0.303707969 6.508873e-03
rho46 BATF Endothelial cells 166 0.208168374 7.116471e-03
rho144 S100A4 Fibroblasts 98 0.266285121 8.042055e-03
rho107 APOBEC3G cDC 1 32 0.461143695 8.484726e-03
rho106 BATF cDC 1 32 0.454962584 8.889571e-03
rho118 BATF B cells 32 -0.453445748 9.767368e-03
rho138 IFNG Alveolar Type 2 cells 79 0.289064485 9.774134e-03
rho27 BATF Alveolar Type 1 cells 28 0.469311200 1.175250e-02
rho135 IL26 Alveolar Type 2 cells 79 0.275283431 1.407222e-02
rho146 BATF Fibroblasts 98 0.245590285 1.479002e-02
rho1110 APOBEC3G cDC 2 32 0.428885630 1.501990e-02
rho98 IFNG T,NK cells 348 -0.127738396 1.711956e-02
rho86 BATF Macrophages 185 -0.174481970 1.753106e-02
rho910 CCL1 T,NK cells 348 0.125658451 2.122753e-02
rho119 BATF cDC 2 32 0.403962746 2.185115e-02
rho103 S100A11 cDC 1 32 0.395161290 2.595561e-02
rho126 BATF pDCs 31 0.388141080 3.095359e-02
rho127 APOBEC3G pDCs 31 0.387337435 3.133313e-02
rho89 TNF Macrophages 185 -0.155955130 3.451405e-02
rho130 TNF B cells 32 -0.373670726 3.514415e-02
rho136 BATF Alveolar Type 2 cells 79 0.231759664 3.986321e-02
rho8 TNFRSF4 Macrophages 185 0.150409510 4.155409e-02
rho139 TNF Alveolar Type 2 cells 79 0.229269765 4.210581e-02
rho102 CD40LG cDC 1 32 0.361165813 4.226439e-02
rho121 KLRB1 pDCs 31 0.355726509 4.953151e-02
rho810 CCL1 Macrophages 185 -0.146957887 5.028982e-02
rho110 S100A11 B cells 32 -0.348240469 5.143618e-02
rho1212 CCL20 pDCs 31 0.347807732 5.520287e-02
rho149 TNF Fibroblasts 98 0.188905939 6.247782e-02
rho64 S100A4 Neutrophils 61 0.235993881 6.710110e-02
rho101 KLRB1 cDC 1 32 0.321917368 7.237372e-02
rho6 TNFRSF4 Neutrophils 61 0.231361895 7.280692e-02
rho1010 CCL1 cDC 1 32 0.323753733 7.561581e-02
rho123 S100A11 pDCs 31 0.318145161 8.150861e-02
rho7 TNFRSF4 Plasma cells 31 0.316834483 8.245322e-02
rho91 KLRB1 T,NK cells 348 0.092370362 8.577191e-02
rho710 CCL1 Plasma cells 31 -0.317411102 8.741815e-02
rho47 APOBEC3G Endothelial cells 166 -0.132876041 8.789127e-02
rho31 KLRB1 Mast cells 84 -0.185048387 9.397315e-02
rho63 S100A11 Neutrophils 61 0.216162578 9.428733e-02
rho22 CD40LG Ciliated cells 17 0.408248290 1.037709e-01
rho1011 CCL20 cDC 1 32 0.291706448 1.052446e-01
rho134 S100A4 Alveolar Type 2 cells 79 0.178182350 1.161648e-01
rho122 CD40LG pDCs 31 0.284856107 1.203719e-01
rho54 S100A4 Eosinophils 27 -0.305430069 1.213212e-01
rho48 IFNG Endothelial cells 166 0.120416171 1.222594e-01
rho53 S100A11 Eosinophils 27 -0.304639805 1.223867e-01
rho12 TNFRSF4 pDCs 31 0.275534794 1.335409e-01
rho75 IL26 Plasma cells 31 -0.265361389 1.490859e-01
rho115 S100A4 cDC 2 32 0.256231672 1.565194e-01
rho17 CD40LG Alveolar Type 1 cells 28 0.267752418 1.683511e-01
rho30 IFNG Alveolar Type 1 cells 28 0.267752418 1.683511e-01
rho85 IL26 Macrophages 185 -0.099985119 1.768852e-01
rho15 KLRB1 Alveolar Type 1 cells 28 -0.259971697 1.815396e-01
rho TNFRSF4 Alveolar Type 1 cells 28 0.258144262 1.847389e-01
rho1410 CCL1 Fibroblasts 98 -0.137277155 1.870356e-01
rho76 BATF Plasma cells 31 0.237929057 1.974338e-01
rho150 CCL20 B cells 32 0.232610000 2.001418e-01
rho11 TNFRSF4 cDC 2 32 0.230838380 2.036874e-01
rho610 CCL1 Neutrophils 61 0.165968326 2.090128e-01
rho125 IL26 pDCs 31 0.230971146 2.112511e-01
rho141 KLRB1 Fibroblasts 98 -0.120579635 2.369349e-01
rho43 S100A11 Endothelial cells 166 0.092134761 2.377551e-01
rho129 IFNG pDCs 31 0.215565922 2.441592e-01
rho112 CD40LG cDC 2 32 0.211126824 2.460776e-01
rho311 CCL1 Mast cells 84 -0.128549316 2.467858e-01
rho52 CD40LG Eosinophils 27 0.227128381 2.545711e-01
rho44 S100A4 Endothelial cells 166 0.086444540 2.681131e-01
rho41 KLRB1 Endothelial cells 166 0.085832805 2.715272e-01
rho39 TNF Alveolar Type 1 cells 28 0.214460072 2.731269e-01
rho109 TNF cDC 1 32 0.197576536 2.783987e-01
rho114 S100A4 B cells 32 -0.196480938 2.798931e-01
rho1114 CCL20 cDC 2 32 0.189790134 2.981568e-01
rho611 CCL20 Neutrophils 61 0.131505064 3.123828e-01
rho911 CCL20 T,NK cells 348 -0.053358059 3.216513e-01
rho1 TNFRSF4 B cells 32 0.178812645 3.274838e-01
rho108 IFNG cDC 1 32 0.176823479 3.329817e-01
rho23 S100A11 Ciliated cells 17 0.233650087 3.667548e-01
rho57 APOBEC3G Eosinophils 27 0.180014364 3.689232e-01
rho140 CCL1 B cells 32 -0.163299316 3.800729e-01
rho1112 TNF cDC 2 32 0.158503094 3.862424e-01
rho120 APOBEC3G B cells 32 0.155791789 3.929650e-01
rho210 APOBEC3G Ciliated cells 17 -0.215091402 4.070683e-01
rho811 CCL20 Macrophages 185 -0.060011666 4.183886e-01
rho50 CCL20 Alveolar Type 1 cells 28 0.154906274 4.312326e-01
rho68 IFNG Neutrophils 61 -0.101723839 4.353444e-01
rho73 S100A11 Plasma cells 31 -0.140665946 4.503841e-01
rho20 S100A4 Alveolar Type 1 cells 28 0.148189362 4.517123e-01
rho16 KLRB1 B cells 32 -0.136979577 4.547141e-01
rho79 TNF Plasma cells 31 0.130578295 4.838279e-01
rho56 BATF Eosinophils 27 -0.139949732 4.862859e-01
rho1211 CCL1 pDCs 31 -0.114811440 5.457568e-01
rho214 CCL20 Ciliated cells 17 0.151486360 5.616548e-01
rho711 CCL20 Plasma cells 31 0.105993461 5.703762e-01
rho61 KLRB1 Neutrophils 61 0.074016215 5.707792e-01
rho74 S100A4 Plasma cells 31 0.101352538 5.874620e-01
rho19 S100A11 Alveolar Type 1 cells 28 -0.105927951 5.916296e-01
rho28 BATF Ciliated cells 17 0.139895844 5.922933e-01
rho511 CCL20 Eosinophils 27 -0.103605389 6.070682e-01
rho51 KLRB1 Eosinophils 27 -0.101330998 6.150245e-01
rho38 IFNG Mast cells 84 0.054029952 6.254581e-01
rho59 TNF Eosinophils 27 0.096019877 6.337692e-01
rho1111 IFNG cDC 2 32 0.082539708 6.533551e-01
rho58 IFNG Eosinophils 27 -0.087357070 6.648174e-01
rho82 CD40LG Macrophages 185 -0.030964460 6.764882e-01
rho26 IL26 Ciliated cells 17 0.102062073 6.966992e-01
rho116 IL26 B cells 32 -0.068083207 7.112008e-01
rho105 IL26 cDC 1 32 -0.068083207 7.112008e-01
rho18 CD40LG B cells 32 0.064950514 7.239603e-01
rho71 KLRB1 Plasma cells 31 0.061648897 7.418121e-01
rho212 TNF Ciliated cells 17 0.084753795 7.463818e-01
rho32 CD40LG Mast cells 84 -0.035855065 7.476006e-01
rho72 CD40LG Plasma cells 31 -0.056971221 7.608122e-01
rho510 CCL1 Eosinophils 27 0.057867175 7.743430e-01
rho24 S100A4 Ciliated cells 17 0.071345695 7.855408e-01
rho78 IFNG Plasma cells 31 0.041650535 8.239504e-01
rho310 TNF Mast cells 84 0.021566239 8.465513e-01
rho1113 CCL1 cDC 2 32 -0.031109769 8.680516e-01
rho312 CCL20 Mast cells 84 -0.015969259 8.860615e-01
rho29 APOBEC3G Alveolar Type 1 cells 28 0.026705947 8.926945e-01
rho111 KLRB1 cDC 2 32 -0.023492377 8.984473e-01
rho128 IFNG B cells 32 0.013993298 9.394094e-01
rho77 APOBEC3G Plasma cells 31 -0.009880028 9.579309e-01
rho62 CD40LG Neutrophils 61 -0.003575747 9.781807e-01
rho21 KLRB1 Ciliated cells 17 0.006586363 9.799849e-01
rho5 TNFRSF4 Eosinophils 27 0.000000000 1.000000e+00
rho2 TNFRSF4 Ciliated cells 17 NA NA
rho25 IL26 Alveolar Type 1 cells 28 NA NA
rho45 IL26 Endothelial cells 166 NA NA
rho55 IL26 Eosinophils 27 NA NA
rho65 IL26 Neutrophils 61 NA NA
rho117 IL26 cDC 2 32 NA NA
rho211 IFNG Ciliated cells 17 NA NA
rho40 CCL1 Alveolar Type 1 cells 28 NA NA
rho213 CCL1 Ciliated cells 17 NA NA
FDRq
rho99 7.451197e-36
rho95 2.607116e-26
rho93 6.394996e-16
rho97 1.178834e-09
rho1311 2.780770e-08
rho69 1.046474e-06
rho14 5.506106e-06
rho411 5.506106e-06
rho42 1.799806e-05
rho67 1.148262e-04
rho81 1.407935e-04
rho49 1.407935e-04
rho33 3.390718e-04
rho124 3.390718e-04
rho34 5.279807e-04
rho83 5.607180e-04
rho66 6.395940e-04
rho132 6.395940e-04
rho137 1.571462e-03
rho37 1.628757e-03
rho35 1.772642e-03
rho147 1.772642e-03
rho104 2.823874e-03
rho113 2.823874e-03
rho94 4.349244e-03
rho87 4.914064e-03
rho142 4.914064e-03
rho1310 7.311565e-03
rho4 7.528621e-03
rho96 8.668779e-03
rho13 8.668779e-03
rho92 9.008311e-03
rho10 9.952764e-03
rho1210 1.277780e-02
rho133 1.277780e-02
rho36 1.486687e-02
rho9 1.486687e-02
rho84 1.486687e-02
rho1411 1.486687e-02
rho3 1.595507e-02
rho88 1.635835e-02
rho145 1.930237e-02
rho143 2.155972e-02
rho148 2.155972e-02
rho410 2.340682e-02
rho131 2.419603e-02
rho46 2.589184e-02
rho144 2.864982e-02
rho107 2.960996e-02
rho106 3.040233e-02
rho118 3.214186e-02
rho138 3.214186e-02
rho27 3.791843e-02
rho135 4.456203e-02
rho146 4.586433e-02
rho1110 4.586433e-02
rho98 5.135868e-02
rho86 5.168639e-02
rho910 6.152386e-02
rho119 6.227577e-02
rho103 7.276082e-02
rho126 8.504707e-02
rho127 8.504707e-02
rho89 9.221722e-02
rho130 9.245614e-02
rho136 1.032820e-01
rho8 1.047422e-01
rho139 1.047422e-01
rho102 1.047422e-01
rho121 1.209984e-01
rho810 1.211206e-01
rho110 1.221609e-01
rho1212 1.293108e-01
rho149 1.443744e-01
rho64 1.529905e-01
rho101 1.616881e-01
rho6 1.616881e-01
rho1010 1.657731e-01
rho123 1.762437e-01
rho7 1.762437e-01
rho91 1.810740e-01
rho710 1.810772e-01
rho47 1.810772e-01
rho31 1.896839e-01
rho63 1.896839e-01
rho22 2.063352e-01
rho1011 2.068601e-01
rho134 2.257293e-01
rho122 2.274796e-01
rho54 2.274796e-01
rho48 2.274796e-01
rho53 2.274796e-01
rho12 2.455429e-01
rho75 2.712095e-01
rho115 2.817349e-01
rho17 2.967838e-01
rho30 2.967838e-01
rho85 3.086466e-01
rho15 3.135684e-01
rho 3.159035e-01
rho1410 3.166643e-01
rho76 3.309920e-01
rho150 3.322743e-01
rho11 3.349091e-01
rho610 3.403923e-01
rho125 3.407918e-01
rho141 3.764456e-01
rho43 3.764456e-01
rho129 3.801836e-01
rho112 3.801836e-01
rho311 3.801836e-01
rho52 3.886755e-01
rho44 4.057286e-01
rho41 4.061278e-01
rho39 4.061278e-01
rho109 4.090746e-01
rho114 4.090746e-01
rho1114 4.320747e-01
rho611 4.488862e-01
rho911 4.583531e-01
rho1 4.628077e-01
rho108 4.667203e-01
rho23 5.087570e-01
rho57 5.087570e-01
rho140 5.199397e-01
rho1112 5.241861e-01
rho120 5.291103e-01
rho210 5.438178e-01
rho811 5.546082e-01
rho50 5.672367e-01
rho68 5.682740e-01
rho73 5.802695e-01
rho20 5.802695e-01
rho16 5.802695e-01
rho79 6.114330e-01
rho56 6.114330e-01
rho1211 6.812001e-01
rho214 6.959636e-01
rho711 6.971660e-01
rho61 6.971660e-01
rho74 7.082668e-01
rho19 7.082668e-01
rho28 7.082668e-01
rho511 7.208935e-01
rho51 7.253048e-01
rho38 7.325570e-01
rho59 7.372417e-01
rho1111 7.548900e-01
rho58 7.629784e-01
rho82 7.711965e-01
rho26 7.889773e-01
rho116 7.948715e-01
rho105 7.948715e-01
rho18 8.038780e-01
rho71 8.142656e-01
rho212 8.142656e-01
rho32 8.142656e-01
rho72 8.234106e-01
rho510 8.327840e-01
rho24 8.395467e-01
rho78 8.751275e-01
rho310 8.935819e-01
rho1113 9.106554e-01
rho312 9.238812e-01
rho29 9.251561e-01
rho111 9.255089e-01
rho128 9.619102e-01
rho77 9.750368e-01
rho62 9.857495e-01
rho21 9.857495e-01
rho5 1.000000e+00
rho2 NA
rho25 NA
rho45 NA
rho55 NA
rho65 NA
rho117 NA
rho211 NA
rho40 NA
rho213 NA
# Extract the top gene and cluster
top_gene <- res_df$gene[1]
top_cluster <- res_df$cluster[1]
# Prepare data for visualization
ss <- prepare_ss(top_gene)
# Ensure Group has the specified ordering
ss <- ss %>%
mutate(Group = factor(Group, levels = c("IgG", "antiCD4", "Naïve")))
# Visualization: Boxplot with stripplot overlay
ggplot(ss, aes(x = cpm, y = CoarseClustering, color = Group)) +
geom_boxplot(outlier.shape = NA) +
geom_jitter(position = position_jitterdodge(jitter.width = 0.2), alpha = 0.7) +
labs(
title = paste("Expression of", top_gene, "by Coarse Clustering"),
x = paste(top_gene, "expression (log2-CPM)"),
y = "Coarse Clustering"
) +
theme_minimal()Warning: Removed 3 rows containing non-finite outside the scale range
(`stat_boxplot()`).
Warning: Removed 3 rows containing missing values or values outside the scale range
(`geom_point()`).
# Visualization: Scatterplot of cpm vs logCFU for the top cluster
ss_top <- ss %>% filter(CoarseClustering == top_cluster)
ggplot(ss_top, aes(x = cpm, y = logCFU)) +
geom_point(alpha = 0.7) +
geom_smooth(method = "lm", se = FALSE, color = "blue") +
labs(
title = paste("Scatterplot of", top_gene, "expression in", top_cluster),
x = paste(top_gene, "expression (log2-CPM)"),
y = "logCFU"
) +
theme_minimal()`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 1 row containing non-finite outside the scale range
(`stat_smooth()`).
Warning: Removed 1 row containing missing values or values outside the scale range
(`geom_point()`).